Statistical Models for Presence-Only Data: Finite-Sample Equivalence and Addressing Observer Bias
نویسندگان
چکیده
Statistical modeling of presence-only data has attracted much recent attention in the ecological literature, leading to a proliferation of methods, including the inhomogeneous poisson process (IPP) model [15], maximum entropy (Maxent) modeling of species distributions [12] [9] [10], and logistic regression models. Several recent articles have shown the close relationships between these methods [1] [15]. We explain why the IPP intensity function is a more natural object of inference in presence-only studies than occurrence probability (which is only defined with reference to quadrat size), and why presence-only data only allows estimation of relative, and not absolute intensities. All three of the above techniques amount to parametric density estimation under the same exponential family model. We show that the IPP and Maxent models give the exact same estimate for this density, but logistic regression in general produces a different estimate in finite samples. When the model is misspecified, logistic regression and the IPP may have substantially different asymptotic limits with large data sets. We propose “infinitely weighted logistic regression,” which is exactly equivalent to the IPP in finite samples. Consequently, many already-implemented methods extending logistic regression can also extend the Maxent and IPP models in directly analogous ways using this technique. Finally, we address the issue of observer bias, modeling the presenceonly data set as a thinned IPP. We discuss when the observer bias problem can solved by regression adjustment, and additionally propose a novel method for combining presence-only and presence-absence records from one or more species to account for it.
منابع مشابه
The point process use-availability or presence-only likelihood and comments on analysis.
1. Use-availability and presence-only analyses are synonyms. Both require two samples (one containing known locations, one containing potential locations), both estimate the same parameters, and both use the same fundamental likelihood. 2. Use-availability and presence-only designs compare characteristics of points where an organism was located to those where the organism could have been locate...
متن کاملBias properties of Bayesian statistics in finite mixture of negative binomial regression models in crash data analysis.
Factors that cause heterogeneity in crash data are often unknown to researchers and failure to accommodate such heterogeneity in statistical models can undermine the validity of empirical results. A recently proposed finite mixture for the negative binomial regression model has shown a potential advantage in addressing the unobserved heterogeneity as well as providing useful information about f...
متن کاملFinite-Sample Equivalence in Statistical Models for Presence-Only Data.
Statistical modeling of presence-only data has attracted much recent attention in the ecological literature, leading to a proliferation of methods, including the inhomogeneous Poisson process (IPP) model, maximum entropy (Maxent) modeling of species distributions and logistic regression models. Several recent articles have shown the close relationships between these methods. We explain why the ...
متن کاملModel-Based Control of Observer Bias for the Analysis of Presence-Only Data in Ecology
Presence-only data, where information is available concerning species presence but not species absence, are subject to bias due to observers being more likely to visit and record sightings at some locations than others (hereafter "observer bias"). In this paper, we describe and evaluate a model-based approach to accounting for observer bias directly--by modelling presence locations as a functio...
متن کاملNew Technical Efficiency Estimates with Improved Bootstrap Confidence Interval Coverage
Bootstrap confidence intervals on fixed-effects efficiency estimates from finite-sample panel data models exhibit low coverage probabilities, because the traditional estimate involves a "max" operator that induces a finite sample bias. Attempts to bootstrap confidence intervals for the traditional estimate have focused on correcting bias. Rather than addressing this bias at the bootstrap stage,...
متن کامل